10. Pandas-Series和DataFrame创建

Pandas：Python数据分析的基石

Pandas建立在NumPy之上，提供了专为结构化数据处理设计的高级工具。

DataFrame：二维表格数据结构（类似Excel表格）
Series：一维带标签数组（类似DataFrame的单列）
Index：行标签，支持时间序列索引
向量化操作：类似NumPy的高性能运算

Pandas为什么优于Excel？

对比维度	Pandas	Excel
数据量	几乎无限制	约100万行
可重复性	代码可重现	操作难以追踪
自动化	可构建流水线	需手动操作
复杂计算	分组、聚合、变换	功能有限
时间序列	专门优化	基础支持

导入Pandas库

Listing 1

# 导入Pandas库，使用pd作为别名（行业惯例）
import pandas as pd
# 导入NumPy库，Pandas依赖NumPy进行数值计算
import numpy as np

# 查看Pandas版本
print(f'Pandas版本: {pd.__version__}')

# 测试Pandas是否正常工作
test_data = pd.Series([1, 2, 3, 4, 5])
print(f'\n测试数据:\n{test_data}')

Pandas版本: 2.3.3

测试数据:
0    1
1    2
2    3
3    4
4    5
dtype: int64

Series：一维带标签数组

Series是Pandas的一维数据结构，类似NumPy数组但带标签索引。

语法：pd.Series(data, index=index, dtype=dtype, name=name)

Values：底层数据（NumPy数组）
Index：行标签（字符串、数字、日期等）
Name：Series的名称（可选）
dtype：数据类型

⭐ 任务一：创建股票涨跌幅Series

Listing 2

# ⚠️ 平台原始代码 - 请原样输入至教学平台（注释除外），平台才会判定答案正确
#任务一：创建股票涨跌幅Series
import pandas as pd
list_may=[-0.0035,-0.0338,0.0058,-0.0298]  # 定义列表list_may
name=['中国卫星','中国软件','中国银行','上汽集团']  # 定义列表name
s=pd.Series(list_may,index=name)  # 创建Series序列s
print(s)  # 输出任务一：创建股票涨跌幅Series

中国卫星   -0.0035
中国软件   -0.0338
中国银行    0.0058
上汽集团   -0.0298
dtype: float64

任务一代码解析

list_may：存储四只股票的涨跌幅数据
name：存储股票名称，用作索引标签
pd.Series(data, index=...)：创建带标签的一维数组
输出中左侧为索引（股票名称），右侧为值（涨跌幅）

索引的意义：

快速查找：类似字典的键值查找
数据对齐：自动按索引对齐运算
时间序列：支持日期时间索引

Series的索引访问

Listing 3

s = pd.Series(
    [-0.0035, -0.0338, 0.0058, -0.0298],
    index=['中国卫星', '中国软件', '中国银行', '上汽集团'],
    name='涨跌幅'
)

# 通过标签索引访问
print(f'中国软件涨跌幅: {s["中国软件"]:.4f}')

# 通过位置索引访问
print(f'第一个元素: {s[0]:.4f}')

# 切片访问前两个元素
print(f'前两个元素:\n{s[:2]}')

中国软件涨跌幅: -0.0338
第一个元素: -0.0035
前两个元素:
中国卫星   -0.0035
中国软件   -0.0338
Name: 涨跌幅, dtype: float64

Series的条件筛选与统计

Listing 4

s = pd.Series(
    [-0.0035, -0.0338, 0.0058, -0.0298],
    index=['中国卫星', '中国软件', '中国银行', '上汽集团'],
    name='涨跌幅'
)

# 布尔索引：筛选负收益股票
print('负收益股票:')
print(s[s < 0])

# 统计摘要
print(f'\n统计摘要:')
print(s.describe())

负收益股票:
中国卫星   -0.0035
中国软件   -0.0338
上汽集团   -0.0298
Name: 涨跌幅, dtype: float64

统计摘要:
count    4.000000
mean    -0.015325
std      0.019467
min     -0.033800
25%     -0.030800
50%     -0.016650
75%     -0.001175
max      0.005800
Name: 涨跌幅, dtype: float64

Series的运算与排序

Listing 5

s = pd.Series(
    [-0.0035, -0.0338, 0.0058, -0.0298],
    index=['中国卫星', '中国软件', '中国银行', '上汽集团'],
    name='涨跌幅'
)

# 向量化运算：涨跌幅翻倍
print('涨跌幅翻倍:')
print(s * 2)

# 按值升序排序
print('\n按值排序(升序):')
print(s.sort_values())

# 按值降序排序
print('\n按值排序(降序):')
print(s.sort_values(ascending=False))

涨跌幅翻倍:
中国卫星   -0.0070
中国软件   -0.0676
中国银行    0.0116
上汽集团   -0.0596
Name: 涨跌幅, dtype: float64

按值排序(升序):
中国软件   -0.0338
上汽集团   -0.0298
中国卫星   -0.0035
中国银行    0.0058
Name: 涨跌幅, dtype: float64

按值排序(降序):
中国银行    0.0058
中国卫星   -0.0035
上汽集团   -0.0298
中国软件   -0.0338
Name: 涨跌幅, dtype: float64

索引 vs 位置：关键区别

访问方式	语法	说明	示例
标签索引	`s['label']`	使用索引名	`s['中国卫星']`
位置索引	`s[0]`	使用整数位置	`s[0]`
标签切片	`s['a':'c']`	包含两端	`s['A':'C']`
位置切片	`s[0:2]`	不包含末端	`s[0:2]`

易混淆点：标签切片包含末端，位置切片不包含！

DataFrame：二维表格数据

DataFrame是Pandas的二维数据结构，类似Excel表格或SQL表。

语法：pd.DataFrame(data, index=index, columns=columns)

Index：行标签（日期、ID等）
Columns：列名（变量名）
Values：二维NumPy数组

⭐ 任务二：创建股票涨跌幅DataFrame

Listing 6

# ⚠️ 平台原始代码 - 请原样输入至教学平台（注释除外），平台才会判定答案正确
#任务二：创建股票涨跌幅DataFrame
import pandas as pd
data={  # 定义data字典，开始构建键值映射
    '中国卫星':[-0.0351,0.0172,-0.0035,-0.0246,0.0394],  # "中国卫星"的数据序列
    '中国软件':[-0.0139,0.0243,-0.0338,0.0146,0.0001],  # "中国软件"的数据序列
    '中国银行':[-0.0139,0.0243,-0.0338,0.0146,0.0001],  # "中国银行"的数据序列
    '上汽集团':[0.0212,0.0021,-0.0298,-0.0027,-0.0143]  # "上汽集团"的数据序列
}  # 数据结构定义结束
index=['20200525','20200526','20200527','20200528','20200529']  # 定义列表index
df=pd.DataFrame(data,index)  # 创建数据框df
print(df)  # 输出数据框数据

            中国卫星    中国软件    中国银行    上汽集团
20200525 -0.0351 -0.0139 -0.0139  0.0212
20200526  0.0172  0.0243  0.0243  0.0021
20200527 -0.0035 -0.0338 -0.0338 -0.0298
20200528 -0.0246  0.0146  0.0146 -0.0027
20200529  0.0394  0.0001  0.0001 -0.0143

任务二代码解析

data字典：键为列名（股票名称），值为数据列表（涨跌幅）
index列表：指定行标签（交易日期）
pd.DataFrame(data, index)：从字典创建DataFrame
字典的键自动成为列名

行索引的选择：

字符串：日期、股票代码等
DatetimeIndex：时间序列（推荐）
RangeIndex：默认整数索引

创建DataFrame：从列表的列表

Listing 7

# 列表的列表，每个子列表代表一行数据
data_list = [
    ['中信证券', 24.78, 0.05],
    ['国泰君安', 20.15, -0.02],
    ['海通证券', 14.03, 0.03]
]
# 创建DataFrame并指定列名
df1 = pd.DataFrame(data_list, columns=['名称', '价格', '涨跌幅'])
print(df1)

     名称     价格   涨跌幅
0  中信证券  24.78  0.05
1  国泰君安  20.15 -0.02
2  海通证券  14.03  0.03

创建DataFrame：从元组列表

Listing 8

# 元组列表，每个元组代表一行数据
data_tuples = [
    ('600519.SH', '贵州茅台', 1850.00),
    ('000858.SZ', '五粮液', 220.50),
    ('600036.SH', '招商银行', 45.20)
]
# 创建DataFrame并指定列名
df2 = pd.DataFrame(data_tuples, columns=['代码', '名称', '价格'])
print(df2)

          代码    名称      价格
0  600519.SH  贵州茅台  1850.0
1  000858.SZ   五粮液   220.5
2  600036.SH  招商银行    45.2

创建DataFrame：从字典列表

Listing 9

# 字典列表，每个字典代表一行数据
data_dict_list = [
    {'code': '600519.SH', 'name': '贵州茅台', 'price': 1850.00},
    {'code': '000858.SZ', 'name': '五粮液', 'price': 220.50},
    {'code': '600036.SH', 'name': '招商银行', 'price': 45.20}
]
# 字典的键自动成为列名
df3 = pd.DataFrame(data_dict_list)
print(df3)

        code  name   price
0  600519.SH  贵州茅台  1850.0
1  000858.SZ   五粮液   220.5
2  600036.SH  招商银行    45.2

创建DataFrame：从NumPy数组

Listing 10

# 生成5行3列的标准正态分布随机数
arr = np.random.randn(5, 3)

# 从NumPy数组创建DataFrame
df4 = pd.DataFrame(
    arr,
    index=['row1', 'row2', 'row3', 'row4', 'row5'],
    columns=['A', 'B', 'C']
)
print(df4)

             A         B         C
row1 -0.923488 -0.516186  1.072817
row2  0.944275  1.348396 -0.643253
row3  0.836565  1.770071 -0.093332
row4 -0.789220 -1.067171  0.842535
row5  0.453361  0.911048 -0.314768

访问DataFrame的列

Listing 11

data = {
    '中国卫星': [-0.0351, 0.0172, -0.0035, -0.0246, 0.0394],
    '中国软件': [-0.0139, 0.0243, -0.0338, 0.0146, 0.0001],
    '中国银行': [-0.0139, 0.0243, -0.0338, 0.0146, 0.0001],
    '上汽集团': [0.0212, 0.0021, -0.0298, -0.0027, -0.0143]
}
index = ['20200525', '20200526', '20200527', '20200528', '20200529']
df = pd.DataFrame(data, index=index)

# 方法1：点号访问（列名无空格时可用）
print('点号访问:')
print(df.中国卫星.head(3))

# 方法2：方括号访问（通用方法，推荐）
print('\n方括号访问:')
print(df['中国软件'].head(3))

# 方法3：同时访问多列
print('\n多列访问:')
print(df[['中国卫星', '中国软件']].head(3))

点号访问:
20200525   -0.0351
20200526    0.0172
20200527   -0.0035
Name: 中国卫星, dtype: float64

方括号访问:
20200525   -0.0139
20200526    0.0243
20200527   -0.0338
Name: 中国软件, dtype: float64

多列访问:
            中国卫星    中国软件
20200525 -0.0351 -0.0139
20200526  0.0172  0.0243
20200527 -0.0035 -0.0338

访问DataFrame的行

Listing 12

# loc：按标签访问
print('loc访问单行:')
print(df.loc['20200526'])

print('\nloc访问多行:')
print(df.loc[['20200526', '20200527']])

# iloc：按位置访问
print('\niloc访问第1行:')
print(df.iloc[0])

print('\niloc访问前2行:')
print(df.iloc[0:2])

loc访问单行:
中国卫星    0.0172
中国软件    0.0243
中国银行    0.0243
上汽集团    0.0021
Name: 20200526, dtype: float64

loc访问多行:
            中国卫星    中国软件    中国银行    上汽集团
20200526  0.0172  0.0243  0.0243  0.0021
20200527 -0.0035 -0.0338 -0.0338 -0.0298

iloc访问第1行:
中国卫星   -0.0351
中国软件   -0.0139
中国银行   -0.0139
上汽集团    0.0212
Name: 20200525, dtype: float64

iloc访问前2行:
            中国卫星    中国软件    中国银行    上汽集团
20200525 -0.0351 -0.0139 -0.0139  0.0212
20200526  0.0172  0.0243  0.0243  0.0021

访问单个值

Listing 13

# 使用loc（标签定位）
value_loc = df.loc['20200526', '中国卫星']
print(f'loc访问: df.loc["20200526", "中国卫星"] = {value_loc:.4f}')

# 使用iloc（位置定位）
value_iloc = df.iloc[0, 0]
print(f'iloc访问: df.iloc[0, 0] = {value_iloc:.4f}')

loc访问: df.loc["20200526", "中国卫星"] = 0.0172
iloc访问: df.iloc[0, 0] = -0.0351

查看DataFrame基本信息

Listing 14

# 形状（行, 列）
print(f'形状: {df.shape}')

# 数据类型
print(f'\n数据类型:')
print(df.dtypes)

# 前3行数据
print(f'\n前3行:')
print(df.head(3))

形状: (5, 4)

数据类型:
中国卫星    float64
中国软件    float64
中国银行    float64
上汽集团    float64
dtype: object

前3行:
            中国卫星    中国软件    中国银行    上汽集团
20200525 -0.0351 -0.0139 -0.0139  0.0212
20200526  0.0172  0.0243  0.0243  0.0021
20200527 -0.0035 -0.0338 -0.0338 -0.0298

DataFrame统计摘要

Listing 15

# describe()：一键查看统计摘要
print(df.describe())
# 包含: count(数量)、mean(均值)、std(标准差)
# min(最小值)、25%/50%/75%(分位数)、max(最大值)

           中国卫星      中国软件      中国银行      上汽集团
count  5.000000  5.000000  5.000000  5.000000
mean  -0.001320 -0.001740 -0.001740 -0.004700
std    0.030368  0.023044  0.023044  0.018995
min   -0.035100 -0.033800 -0.033800 -0.029800
25%   -0.024600 -0.013900 -0.013900 -0.014300
50%   -0.003500  0.000100  0.000100 -0.002700
75%    0.017200  0.014600  0.014600  0.002100
max    0.039400  0.024300  0.024300  0.021200

DataFrame转置

Listing 16

# 转置：行列互换
print('转置前形状:', df.shape)
print('\n转置后:')
print(df.T)
print('\n转置后形状:', df.T.shape)

转置前形状: (5, 4)

转置后:
      20200525  20200526  20200527  20200528  20200529
中国卫星   -0.0351    0.0172   -0.0035   -0.0246    0.0394
中国软件   -0.0139    0.0243   -0.0338    0.0146    0.0001
中国银行   -0.0139    0.0243   -0.0338    0.0146    0.0001
上汽集团    0.0212    0.0021   -0.0298   -0.0027   -0.0143

转置后形状: (4, 5)

条件筛选：单条件

Listing 17

# 筛选中国卫星涨幅为正的交易日
print('中国卫星涨幅>0的交易日:')
positive = df[df.中国卫星 > 0]
print(positive)

中国卫星涨幅>0的交易日:
            中国卫星    中国软件    中国银行    上汽集团
20200526  0.0172  0.0243  0.0243  0.0021
20200529  0.0394  0.0001  0.0001 -0.0143

条件筛选：多条件与query

Listing 18

# 多条件：使用 & (且) 连接
print('中国卫星>0 且 中国软件>0:')
multi_cond = df[(df.中国卫星 > 0) & (df.中国软件 > 0)]
print(multi_cond)

# query方法：语法更直观
print('\nquery方法筛选:')
query_result = df.query('中国卫星 > 0 and 中国软件 > 0')
print(query_result)

中国卫星>0 且 中国软件>0:
            中国卫星    中国软件    中国银行    上汽集团
20200526  0.0172  0.0243  0.0243  0.0021
20200529  0.0394  0.0001  0.0001 -0.0143

query方法筛选:
            中国卫星    中国软件    中国银行    上汽集团
20200526  0.0172  0.0243  0.0243  0.0021
20200529  0.0394  0.0001  0.0001 -0.0143

Series运算：自动索引对齐

Listing 19

s1 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s2 = pd.Series([5, 15, 25], index=['a', 'b', 'd'])

print('Series 1:', s1.values, '索引:', list(s1.index))
print('Series 2:', s2.values, '索引:', list(s2.index))

# 加法：自动对齐索引
print('\n加法(自动对齐):')
print(s1 + s2)
# a:15, b:35, c:NaN(无对应), d:NaN(无对应)

Series 1: [10 20 30] 索引: ['a', 'b', 'c']
Series 2: [ 5 15 25] 索引: ['a', 'b', 'd']

加法(自动对齐):
a    15.0
b    35.0
c     NaN
d     NaN
dtype: float64

核心机制：只有相同索引的值才会运算，不同的产生NaN。

DataFrame算术运算

Listing 20

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]})

# DataFrame加法
print('DataFrame加法:')
print(df1 + df2)

# 标量运算（广播机制）
print('\nDataFrame乘以标量10:')
print(df1 * 10)

DataFrame加法:
    A   B
0  11  44
1  22  55
2  33  66

DataFrame乘以标量10:
    A   B
0  10  40
1  20  50
2  30  60

统计函数：按列计算

Listing 21

data = {
    '中国卫星': [-0.0351, 0.0172, -0.0035, -0.0246, 0.0394],
    '中国软件': [-0.0139, 0.0243, -0.0338, 0.0146, 0.0001],
    '中国银行': [-0.0139, 0.0243, -0.0338, 0.0146, 0.0001],
    '上汽集团': [0.0212, 0.0021, -0.0298, -0.0027, -0.0143]
}
index = ['20200525', '20200526', '20200527', '20200528', '20200529']
df = pd.DataFrame(data, index=index)

# 每只股票的平均涨跌幅
print('均值(axis=0, 按列):')
print(df.mean())

# 每只股票的波动率
print('\n标准差:')
print(df.std())

均值(axis=0, 按列):
中国卫星   -0.00132
中国软件   -0.00174
中国银行   -0.00174
上汽集团   -0.00470
dtype: float64

标准差:
中国卫星    0.030368
中国软件    0.023044
中国银行    0.023044
上汽集团    0.018995
dtype: float64

统计函数：按行计算

Listing 22

# 每个交易日所有股票的平均涨跌幅
print('均值(axis=1, 按行):')
print(df.mean(axis=1))

均值(axis=1, 按行):
20200525   -0.010425
20200526    0.016975
20200527   -0.025225
20200528    0.000475
20200529    0.006325
dtype: float64

axis=0（默认）：沿列方向计算 → 每只股票的统计值
axis=1：沿行方向计算 → 每个交易日的统计值

累积收益率计算

Listing 23

# 计算累积收益率
cum_returns = (1 + df).cumprod() - 1
print('累积收益率:')
print(cum_returns)

累积收益率:
              中国卫星      中国软件      中国银行      上汽集团
20200525 -0.035100 -0.013900 -0.013900  0.021200
20200526 -0.018504  0.010062  0.010062  0.023345
20200527 -0.021939 -0.024078 -0.024078 -0.007151
20200528 -0.045999 -0.009829 -0.009829 -0.009832
20200529 -0.008412 -0.009730 -0.009730 -0.023991

计算步骤：

1 + df：涨跌幅 → 收益率因子（如 -0.01 → 0.99）
.cumprod()：计算累积乘积
- 1：因子 → 累积收益率

缺失值检测

Listing 24

# 创建包含缺失值的DataFrame
df_nan = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

print('原始数据:')
print(df_nan)

# 检测缺失值
print('\n缺失值位置(True=缺失):')
print(df_nan.isnull())

原始数据:
     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  NaN  11
3  4.0  8.0  12

缺失值位置(True=缺失):
       A      B      C
0  False  False  False
1  False   True  False
2   True   True  False
3  False  False  False

缺失值删除

Listing 25

df_nan = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# 删除包含缺失值的行
print('删除含缺失值的行:')
print(df_nan.dropna())

# 删除包含缺失值的列
print('\n删除含缺失值的列:')
print(df_nan.dropna(axis=1))

删除含缺失值的行:
     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12

删除含缺失值的列:
    C
0   9
1  10
2  11
3  12

缺失值填充

Listing 26

df_nan = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
})

# 方法1：用0填充
print('用0填充:')
print(df_nan.fillna(0))

# 方法2：前向填充（用前一个有效值填充）
print('\n前向填充:')
print(df_nan.ffill())

# 方法3：用均值填充
print('\n用均值填充:')
print(df_nan.fillna(df_nan.mean()))

用0填充:
     A    B   C
0  1.0  5.0   9
1  2.0  0.0  10
2  0.0  0.0  11
3  4.0  8.0  12

前向填充:
     A    B   C
0  1.0  5.0   9
1  2.0  5.0  10
2  2.0  5.0  11
3  4.0  8.0  12

用均值填充:
          A    B   C
0  1.000000  5.0   9
1  2.000000  6.5  10
2  2.333333  6.5  11
3  4.000000  8.0  12

本章小结

概念	说明	关键方法
Series	一维带标签数组	`pd.Series()`
DataFrame	二维表格数据	`pd.DataFrame()`
索引访问	标签/位置	`loc` / `iloc`
条件筛选	布尔索引	`df[条件]` / `query()`
统计函数	按行/按列	`mean()` / `std()`
缺失值	检测/删除/填充	`isnull()` / `dropna()` / `fillna()`